首页> 外文OA文献 >Clustering transformed compositional data using K-means, with applications in gene expression and bicycle sharing system data
【2h】

Clustering transformed compositional data using K-means, with applications in gene expression and bicycle sharing system data

机译:使用K-means聚类转换后的成分数据,并将其应用于基因表达和自行车共享系统数据

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

Although there is no shortage of clustering algorithms proposed in the literature, the question of the most relevant strategy for clustering composi-tional data (i.e., data made up of profiles, whose rows belong to the simplex) remains largely unexplored in cases where the observed value of an observation is equal or close to zero for one or more samples. This work is motivated by the analysis of two sets of compositional data, both focused on the categorization of profiles but arising from considerably different applications: (1) identifying groups of co-expressed genes from high-throughput RNA sequencing data, in which a given gene may be completely silent in one or more experimental conditions ; and (2) finding patterns in the usage of stations over the course of one week in the Velib' bicycle sharing system in Paris, France. For both of these applications , we focus on the use of appropriately chosen data transformations, including the Centered Log Ratio and a novel extension we propose called the Log Centered Log Ratio, in conjunction with the K-means algorithm. We use a nonasymptotic penalized criterion, whose penalty is calibrated with the slope heuristics, to select the number of clusters present in the data. Finally, we illustrate the performance of this clustering strategy, which is implemented in the Bioconductor package coseq, on both the gene expression and bicycle sharing system data.
机译:尽管在文献中不缺乏聚类算法的建议,但是在观察到的情况下,关于聚类组合数据(即由轮廓组成的数据,其行属于单纯形的数据)的最相关策略的问题仍未得到解决。一个或多个样本的观察值等于或接近零。这项工作是通过分析两组组成数据而推动的,这两组组成数据均侧重于概况的分类,但源于截然不同的应用程序:(1)从高通量RNA测序数据中鉴定出共表达基因的组,其中给定基因在一种或多种实验条件下可能完全沉默; (2)在法国巴黎的Velib自行车共享系统中,在一周的使用过程中发现车站的使用方式。对于这两个应用程序,我们专注于使用适当选择的数据转换,包括居中对数比和我们建议的对数居中对数比与K-means算法结合使用的新颖扩展。我们使用非渐近惩罚标准(其惩罚通过斜率启发法进行校准)来选择数据中存在的簇数。最后,我们说明了在基因表达和自行车共享系统数据上在Bioconductor包coseq中实施的这种聚类策略的性能。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
代理获取

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号